feat(ai): add 50-token overlap between embedding splits#35367
Conversation
Now overlaps 15-20% ⏺ The change is in EmbeddingsRunner.java:62-101. Here's what changed: - Before: on each split, buffer was cleared entirely — new chunk started with only the sentence that triggered the overflow. - After: on each split, the code walks backward through the sentence list to collect trailing sentences totaling ≥50 tokens, carries those into the next chunk as overlap, then resets totalTokens to the overlap size. ref: #35366
|
Claude finished @wezell's task in 1m 9s —— View job Code Review
Three issues worth addressing before merge: 1. Config inconsistency —
2. Missing guard:
while (overlapStart > 0 && overlapCount < overlapTokens) {If someone sets 3. A single oversized sentence can silently blow past the overlap target The while loop exits when One subtlety that's correct but fragile:
No functional bugs found — the core logic (backward traversal for overlap, clearing lists after save, |
When a content's sentence stream ended exactly on a chunk boundary, the trailing block re-saved the overlap subset as a standalone chunk — duplicate content that embeddingExists() couldn't detect (it compares full normalized chunk text). Track unsaved content explicitly and only emit the final chunk when new sentences have been added since the last save. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
EmbeddingsRunnerto carry a ~50-token overlap from the end of each chunk into the start of the nextStringBuilderbuffer with parallelList<String> sentences/List<Integer> tokenCountslists to enable backward traversal for overlap computationArrayListimportMotivation
Without overlap, context at chunk boundaries is lost — a sentence split across two chunks has neither half with full surrounding context. A 50-token trailing overlap ensures semantic continuity between consecutive embedding chunks.
Test plan
Closes #35366
🤖 Generated with Claude Code
This PR fixes: #35366